The holographic principle and the language of genes 
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We show that the holographic principle in quantum gravity imposes a strong constraint on life. 
The degrees of freedom of an organism can be estimated according to the theory of Boolean networks, 
which is constrained by the entropy bound. Hence we can explain the languages in protein sequences 
or in DNA sequences. The overall evolution of biological complexity can be illustrated. And some 
general properties of protein length distributions can be explained by a linguistic mechanism. 
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INTRODUCTION 

The general principles in non-living systems play signif- 
icant roles in living systems. How do the principles in 
gravity theory or in quantum mechanism impact on our 
understanding of life? An organism can not keep active 
without the supply of energy due to the first law in ther- 
modynamics. And it can not live long without the supply 
of minus entropy due to the second law in thermodynam- 
ics. But it seems that there are no direct effects of the rel- 
ativity principle or the uncertainty principle on life. We 
found that the holographic principle, which is likely only 
one of several independent conceptual advances needed 
for progress in quantum gravity [H , profoundly 
constraints the forms of life and substantially impacts on 
the evolution of life. 

The holographic principle states that there is a precise, 
general and surprisingly strong limit on the information 
content of spacetime regions. The number of quantum 
states in a spatial region is bounded from above by the 
surface of the region measured in the unit of four-fold 
Planck areas. This entropy bound is a strong constraint 
on any theory about our universe. If this principle is true, 
field theory or string theory, where there are infinite de- 
grees of freedom, can not be the ultimate theory. And if 
this principle can be applied to the phenomenon of life, 
the degrees of freedom in a living system will also be con- 
strained. From this point of view, the principles in rela- 
tivity or in quantum theory constrain life in an alterna- 
tive way. The holographic principle indicates that there 
is a strict relationship between the information storage 
capacity of the space and the complexity of any organism 
wherein. Such a basic idea can be illustrated by a sim- 
ple example. Whatever a living system with n degrees of 
freedom is, we can conclude that it can never exist in a 
universe with a horizon area less than 4nZp, where l p is 
the Planck length. 

In this paper, we estimated the immense degrees of 
freedom for living systems according to the theory of 
gene regulatory networks and Boolean networks 0] Q. 
We found a contradiction between the possible degrees 
of freedom of living systems and the maximum informa- 



tion storage capacity in the observed universe. Then we 
reconciled this contradiction in terms of the causality be- 
tween the possible sequences of macromolecules for the 
actual living systems, which is equivalent to the existence 
of language of genes. We propose evidences of language 
of genes and we can explain the outline of protein length 
distributions by a linguistic mechanism of generation of 
protein sequences. We can also explain the leaps in the 
evolution of biological complexity according to the en- 
tropy bound. 



IMMENSE DEGREES OF FREEDOM IN LIVING 
SYSTEMS 



Information properly bridges biology and physics 
010 [HI j which gives deep insights into the nature of 
life. With the development of genetics, we know that the 
gene regulatory networks play significant roles in devel- 
opment and evolution of life 0. Based on the theory 
of self-organization, Kauffman proposed a general the- 
ory of Boolean networks to describe the gene regulatory 
networks, where the interactions between genes can rep- 
resented by Boolean operations between the nodes of the 
network [7(. Thus, the degrees of freedom of a living 
system can be estimated by the number of states of the 
corresponding Boolean network. Proteins arc the elemen- 
tary units in the activities of life. So a living organism 
can be represented by a dynamical system of all the pro- 
teins in its body. We denote the set V as all possible 
protein sequences with a cutoff of protein length I. Pro- 
teins are chains concatenated by 20 amino acids. So there 
are m = Sj £=1 20 fc elements in the set V . We define a 
Boolean network M as the Boolean network whose nodes 
are elements of V (Fig. la). According to the definition 
of Boolean networks, there are two states for each node 
of a Boolean network: "on" or "off" 0. A state of M 
represent that some nodes are "on" while the others are 
"off" . So a proteome can be represented by a state of J\f, 
where only the nodes corresponding to protein sequences 
in the proteome are "on" . The state space S consists of 
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all possible states of Af whose number is 

< = 2™. (1) 

An actual species can be represented by a point in S. The 
evolution of a species can be illustrated by a trajectory in 
S (Fig. 2b). As a preliminary consideration, the degrees 
of freedom of a living system can be estimated by the 
logarithm of number of states 

d! ~lnn' s ~ 20' In 2, (2) 

which we will reconsider later on. 

According to the holographic principle, we can cal- 
culate that the information in the observed universe is 
about [TO] 

Iumv = 10 122 bitS, (3) 

This value is too large for non-living systems. For ex- 
ample, the information of black body cosmic background 
photons is about 10 90 bits, which may be the largest de- 
grees of freedom for possible non-living systems. But 
it is still much less than I un iv The remaining informa- 
tion storage capacity in our universe has not been wasted 
however for there being living systems. The degrees of 
freedom for living systems are so immense that may ex- 
ceed the maximum information storage capacity in the 
observed universe. 

The structure of chains of genetic macromolecules es- 
sentially provides immense degrees of freedom for liv- 
ing systems, because the number of possible protein se- 
quences can be as large as 20'. For a living system, the 
degrees of freedom may be equivalent to that of the ob- 
served universe if the protein length is about n* = 94 
amino acids. Interestingly, the most frequent protein 
length for the life on our planet is about n* . The im- 
mense degrees of freedom of living systems originate from 
the great number of possible sequences in Af . Most of the 
degrees of freedom come from the states of Af in which 
about half the nodes are "on". On the other hand, the 
degrees of freedom can also come from the states in which 
only a minority of nodes are "on" . Our living systems be- 
long to the latter case, where there are only thousands 
of proteins in actual proteomes. 

ENTROPY BOUND AND THE CAUSALITY OF 
SEQUENCES 

The estimate of immense degrees of freedom of a living 
system in the above, however, seriously contradicts the 
holographic principle if we consider the actual life around 
us. The average protein length in a proteome ranges 
about from 250 amino acids to 550 amino acids, and a 
certain number of proteins are longer than thousands of 
amino acids. According to the preliminary estimate, the 
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FIG. 1: Boolean networks and the causal relationship 
between the macromolecular sequences, a, The nodes 
of the Boolean network Af consist of all possible protein se- 
quences in V with length less than I amino acids. For each 
node, there are two states "on" or "off' , so there are 2 m states 
for Af. The state so is a state of Af in which some nodes are 
"on" (represented by black dots), b, Only a part of the states 
in Af may have biological meaning in an actual living system. 
The number of states in Af may exceed I un iv, but the number 
of states in IA can not be greater than I u niv 



degrees of freedoms for the actual living systems on our 
planet will be much larger than the maximum degrees 
of freedoms in the observed universe I un iv We have to 
reconcile the contradiction between the preliminary esti- 
mate of degrees of freedom of living systems and the con- 
clusion of the holographic principle. If the holographic 
principle is not invalid, we must find ways to shrink the 
preliminary estimate of the degrees of freedom of living 
systems. 

We introduce the causality between the states of Af to 
reveal the additional constraint on the degrees of freedom 
by the entropy bound. At the beginning of the evolution 
on the planet, the first living system may be denoted as 
an inertial state sq. When the degrees of freedom of Af 
is greater than I U niv, not all the states of Af can have 
causal relationship with so unless the holographic princi- 
ple is untrue. We define the set U as all the states that 
have causal relationship with so, which has n s states and 
is only a proper subset of S (Fig. lb). The nodes of 
U constitute £, which is a subset of V . An actual liv- 
ing system at present corresponds to a dynamic system 
evolving only in the state space U and a meaningful pro- 
tein sequence in biology must belong to C. The degrees 
of freedom of a living system, therefore, can be defined 
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FIG. 2: The finite degrees of freedom require language 
of sequences, a, A toy model can explain the necessity of 
the language in sequences. Suppose that the entropy bound 
requires that the information can not be greater than 2 bits in 
a tiny universe. There is no living system corresponding to the 
Boolean network with 3 nodes aa, ab and ba. We can choose a 
subset aa and ab as the nodes of an available Boolean network, 
which corresponds to an actual living system in this universe. 
Thus we obtain a language consisting of 2 words: aa and ab. 
b, Only the states in the set U C S has causal relationship 
with the inertial state so due to the entropy bound. We can 
obtain a language C of the sequences, which is a subset (dots 
on A/") of V . The number of elements in C must be less than 

luniv - 



by the number of states in 14: 

d = lnn s , (4) 

where n s is much less than n' s and d can be rightly less 
than luniv ■ 



of finite degrees of freedom in life and the requirement of 
the language of genes for all the theories. To some extent, 
the language of genes is a consequence of the principles in 
quantum gravity. The phenomenon of life is constrained 
strictly by the entropy bound. The requirement of the 
order of sequences by the grammars can not be explained 
in the context of classical physics because the degrees of 
freedom of life can be infinite. 

The ability of speaking for human beings is determined 
by genes. That we can communicate with each other in- 
stinctively can be attributed to our common genes. The 
human language can be viewed as a transformation of 
cell language [13|. The information storage capacity of 
a natural language can also be estimated by the similar 
calculation in the above. For instance, we estimated that 
there are up to Ihuman = 26' bits of information can be 
written in a language with 26 letters and the length of 
words in the language is V ~ 10, which is much less than 
the protein length. In this sense, the natural language 
is simpler than the language of genes. The value Ihuman 
is much less than the information in the observed uni- 
verse l U niv ■ So the description of the universe by natural 
language is always a simplified version of the actually 
complex world. Interestingly, there were not rare cases 
to reach the same goal by different routes in the his- 
tory of natural sciences, such as, Riemannian Geometry 
and general relativity, or the theory of bundles and gauge 
theories. Such encounters may come from that all the de- 
scriptions in different subjects have a common ultimate 
theory of all the information in the universe, although we 
can not understand all the details of the world by only 
one subject. 

THE LANGUAGE OF GENES AND 
UNDERLYING ORDER IN SEQUENCES 



Several attempts have been made over the past three 



THE LANGUAGE REQUIRED BY THE 
HOLOGRAPHIC PRINCIPLE 



The causality provides a physical explanation to dis- 
tinguish a part of sequences C from all possible sequences 
V . Not all the amino acid chains or base chains are 
meaningful in biology. According to the theory of for- 
mal language, a language is defined by a subset of all 
the sequences concatenated by letters in a given alpha- 
bet [12j. The choice of a subset C from V is a natural 
way to define a formal language (Fig. 2). The protein or 
DNA languages originate in the constraint on the degrees 
of freedom of life by the entropy bound. The alphabet of 
protein language consists of 20 amino acids, and the al- 
phabet of the language of genes consists of 4 bases. The 
arrangement of the letters in the sequences should be de- 
termined by some grammars. Although there are various 
entropy bounds, there is no difference for the requirement 



decades to combine linguistic theory with biology [14 1 
[THj l . The distribution of the number of occurrences of 
protein domains in a genome can be a good fit of the 
power-law distribution known as Zipf's law in linguis- 
tics, and we can distinguish between the protein linguis- 
tics and the language of genes according to the theory of 
formal language 15|. So the experimental observations 
support the existence of languages in the sequences of 
macromolecules. On one hand, they are required by the 
holographic principle. On the other hand, they are con- 
sequences of the evolution of life at the molecular level 
3 12 1 HI HI- The alphabets of amino acids or bases 
formed at the beginning of life. And genetic code devel- 
oped and fixed in the early stage of evolution. All these 
factors can determine whether a sequence is permitted in 
a life, which is equivalent to the role of grammars at the 
molecular level. 

We found a strong evidence of the underlying mecha- 
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FIG. 3: Relationship between the average protein 
length I and the frequency / m of the highest peak 
of discrete fourier transformation of protein length 
distribution. The distribution of the species from three do- 
mains likes a rainbow. Even for the group of closely related 
species such as mycoplasmas (belonging to eubacteria), their 
distribution also form an "arch" of the rainbow. This is a 
strong evidence for the underlying mechanism of the protein 
length distributions. 



nism in the organization of amino acids in protein se- 
quences by studying the correlations between protein 
length distributions, which indicates the languages in the 
protein sequences. The protein length distribution cor- 
responds to a vector 



D = (D(l),D(2),...,D(g),...D(c)), 



(5) 



where there are D(g) proteins with length g in the com- 
plete proteome of a species and c = 3000 is the cutoff 
of protein length. Our data of the protein length distri- 
butions are obtained from the data of 106 complete pro- 
teomes in the database Predictions for Entire Proteomes 
[2cj |. The discrete fourier transformation of the protein 
length distribution is: 



D(f) 



,27r»( ff -l)(/-l)/c 



(6) 



Let f m denotes the frequency of the highest peak D(f m ) 
in the discrete fourier transformation of the protein 
length distribution for a species. We found that there is 
an interesting relationship between the frequency f m and 
the average protein length / of species. The distribution 
of species in I — f m plane shows an regular pattern: the 
species in the three domains (Archaebacteria, Eubacte- 
ria and Eukaryotes) gathered in three rainbow-like arches 
respectively (Fig 3). This pattern strongly indicates the 
intrinsic correlation among the protein length distribu- 
tions, which can never achieve if the protein length dis- 
tributions are stochastic. The periodic-like fluctuations 
in the protein length distribution 21| may also originate 



in the underlying mechanism of generation of protein se- 
quences. 



EXPLANATION OF THE ORDER IN PROTEIN 
SEQUENCES 

We propose a model to reveal the underlying mecha- 
nism in the protein sequences according to tree adjoining 
grammar [22|. In the model, protein sequences can be 
generated by tree adjoining operations, i.e., substitut- 
ing the initial tree or auxiliary trees into to each other 
by identifying the inner nodes (Fig. 4a) [22| . There is 
only one variant t in the model, which is the probabil- 
ity of substitutions in the adjoining operations and de- 
notes different species. A certain number of proteins can 
be generated when t is fixed, hence we obtain a protein 
length distribution by the model (Fig. 4b). The proper- 
ties of protein length distributions can be explained by 
the simulation. The outline and the fluctuations of the 
simulated protein length distribution agree with the ac- 
tual protein length distributions in principle. 

We show that there is a close relationship between the 
protein length distributions and grammar rules. The 
fluctuations in the distributions are determined by the 
grammar rules. The same grammar rule corresponds to 
the same distribution. If changing grammar rules, we 
obtain different outlines and fluctuations of distribution. 
This result suggests that the fluctuations in actual pro- 
tein length distributions are intrinsic properties of certain 
species and may infer the underlying mechanism on the 
order of protein sequences. 



THE MACROEVOLUTION OF BIOLOGICAL 
COMPLEXITY 

The evolution of complexity of life is not a linear course 
of increment (23[[24]]. The entropy bound can also ex- 
plain the leaps in the evolution of biological complex- 
ity. Consequently we can outline the macroevolution 
of life. The gene regulatory networks are accelerating 
networks [25] 26]. According to this theory, the evolu- 
tion of complexity of any accelerating networks has to be 
slowed down and will stop at an upper limit of complex- 
ity. Hence there must be upper limits of complexity in 
both of the evolution of biological complexity for prokary- 
otes and eukaryotes, where the entropy bound is a natural 
upper limit. The whole evolution of biological complexity 
can be, therefore, divided into three steps: the evolution 
of unicellular life, the evolution of multicellular life and 
the evolution of society of human beings. The Cambrian 
explosion divided the first two steps. And we found that 
the evolution of multicellular life has reached its upper 
limit because the maximum non-coding DNA content is 
near to 1 at present. The civilization of human beings 
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[6] 
[7] 



Protein length 



FIG. 4: Simulation of protein length distributions by 
a linguistic model, a, The tree adjoining grammar. There 
are one initial tree and two auxiliary trees, where S and T 
are inner nodes and x or x x are leaves which represent the 
amino acids, b, The simulation of protein length distribution 
by the tree adjoining grammar. The properties of protein 
length distributions such as the outline and fluctuations can 
be simulated by the linguistic model. 



appeared, which can be taken as an alternative form of 
biological complexity. The entire evolution of biological 
complexity should be governed by a universal mechanism 
of evolution. The universal language of genes in species 
may harmonize the evolution of life in the biosphere. 
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